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Capture-recapture methods, largely developed in ecology, are now commonly used in epidemiology to adjust for 
incomplete registries and to estimate the size of difficult-to-reach populations such as problem drug users. Over- 
lapping lists of individuals in the target population, taken from administrative data sources, are considered analo- 
gous to overlapping "captures" of animals. Log-linear models, incorporating interaction terms to account for 
dependencies between sources, are used to predict the number of unobserved individuals and, hence, the total 
population size. A standard assumption to ensure parameter identifiability is that the highest-order interaction 
term is 0. We demonstrate that, when individuals are referred directly between sources, this assumption will often 
be violated, and the standard modeling approach may lead to seriously biased estimates. We refer to such individ- 
uals as having been "precaptured," rather than truly recaptured. Although sometimes an alternative identifiable log- 
linear model could accommodate the referral structure, this will not always be the case. Further, multiple plausible 
models may fit the data equally well but provide widely varying estimates of the population size. We demonstrate an 
alternative modeling approach, based on an interpretable parameterization and driven by careful consideration of 
the relationships between the sources, and we make recommendations for capture-recapture in practice. 

bias; log-linear models; model selection; parameter identifiability; problem drug use; prevalence estimation 



Abbreviations: AIC, Akaike information criterion; CRC, capture-recapture; CI, confidence interval; PDU, problem drug user. 



Capture-recapture (CRC) methods are widely used in epi- 
demiology to adjust for incomplete ascertainment by disease 
registries and to estimate the size of elusive populations (1^1). 
A CRC data set consists of overlapping lists of individuals in 
the target population taken from administrative data sources. 
This is considered akin to multiple captures and recaptures in 
animal abundance studies (5, 6). The observed overlaps are 
used to estimate the size of the unobserved population and, 
hence, the total population size. CRC is particularly useful 
for providing prevalence estimates and denominators for 
populations that are underestimated by standard sampling 
methods, for example, problem drug users (PDUs) (7-20), 
commercial sex workers, the homeless, and individuals with 
subclinical disease (21-25). In Europe, CRC is the recom- 
mended method for estimating the size of PDU populations 



(26) and, in the United Kingdom, CRC estimates are used to 
monitor the effectiveness of drag policy (27, 28). 

Generally, log-linear regressions are used to model the 
observed data, assuming a multinomial or (equivalently) 
Poisson likelihood (29). Aspects requiring careful consider- 
ation include the possibility of dependencies between data 
sources, whereby the appearance of an individual in 1 source 
affects his or her probability of appearance in another, as well 
as heterogeneities in capture probabilities, such that specific 
subgroups of the population are more or less likely to be re- 
corded. These 2 phenomena are closely related (2, 30), but in 
some cases the latter may be accounted for by model stratifi- 
cation or the incorporation of covariates in the regression 
model (31, 32). In this paper, we focus instead on issues re- 
lating to dependencies between data sources that cannot be 
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accounted for by using measured covariates. An early sug- 
gestion to merge dependent data sources (33) has now been 
subsumed within the more general approach of incorporation 
of source-by-source interaction terms in a log-linear model 
(2, 29, 34). 

We will demonstrate that this standard approach is often in- 
adequate in the presence of referrals of individuals between 
sources, a mechanism we believe to be very common in 
CRC studies of human populations. When a proportion of in- 
dividuals "captured" in 1 source is referred to another, these 
individuals can be thought of as having been "precaptured" 
rather than truly "recaptured." Because they have simply 
been passed from 1 source to another, they cannot be informa- 
tive about the population size. Whereas more general source 
dependencies in epidemiologic CRC have their counterpart 
in ecological CRC (broadly comparable to "trap fascination" 
or "trap avoidance" (4)), direct referrals would be analogous to 
catching a sample of fish in a net, setting some proportion of 
these fish aside, and letting them contribute to the second 
sample. 

The critical issues are those of model selection and param- 
eter identification. Often, referrals will not be the only mech- 
anism creating between-source dependencies, so that the full 
dependency structure might be quite complex. In a CRC anal- 
ysis of S data sources, inclusion of all possible interaction 
terms in a log-linear model would require 2 s parameters — 
1 more than the number of observations (29). It is therefore 
necessary to enforce some constraint, such as assuming that 
at least 1 of the possible interaction terms is 0. The usual ap- 
proach is to consider only models with no S-way interaction 
(2). This is motivated by the view that higher-order interac- 
tions are unlikely in the absence of their lower-order relatives, 
sometimes referred to as the "hierarchy principle" (4, 29). 
However, Cormack (35) describes the no S-way interac- 
tion assumption as an "act of faith." Variable catchability 
(heterogeneity) and migration are 2 phenomena that have 
been recognized to induce interactions of an order higher 
than 1 (1, 36). 

Using the 3-source case as an example, we will show that 
between-source referrals will often correspond to a 3-way 
interaction term in the log-linear model. Sometimes an alter- 
native saturated log-linear model would adequately accom- 
modate the referral structure. However, alternative saturated 
models can lead to widely different estimates of the popula- 
tion size and, because all fit the data perfectly by definition, it 
is impossible to choose between them without expert knowl- 
edge of the interrelationships between the sources. Further, in 
slightly more complex scenarios, we will show that the true 
model structure cannot be represented by any identifiable 
log-linear model. 

MOTIVATING EXAMPLE: PREVALENCE OF PROBLEM 
DRUG USE IN ENGLAND 

Hay et al. (27) estimated the number of PDUs (i.e., users 
of opiates and/or crack cocaine) aged 15-64 years in England 
in the 1-year period from April 2009 to March 2010. Lists 
of PDUs identified in each of the following 4 administra- 
tive data sources during that time period were obtained and 
linked: "treatment in the community," "arrest for possession," 



Table 1. Observed Numbers of Problem Drug Users, England, 
2009-2010 



Treatment 
(Source 1 ) a 


Arrest Probation 
(Source 2) a (Source 3) a 


Notation for 
Observed 

No. of 
Individuals 


\j user vcu 
Data 
(Total = 192,551) 


Yes 


Yes 


Yes 


*111 


1,868 


Yes 


Yes 


No 


*110 


3,123 


Yes 


No 


Yes 


*101 


21,633 


Yes 


No 


No 


*100 


147,049 


No 


Yes 


Yes 


*011 


349 


No 


Yes 


No 


*010 


4,491 


No 


No 


Yes 


*001 


14,038 


No 


No 


No 


*000 


Unobserved 



a "Yes" denotes presence, and "no" denotes absence from the data 
source. 



"probation," and "prison." For ease of exposition, we discard 
the prison source but return to discuss this later. We also aggre- 
gate over the available covariate values (age group, sex, and 
geographical area). For a description of the data sources and 
the process used for matching individuals, see the article by 
Hay et al. (37). 

The aggregated data are shown in Table 1 . We refer to the 
sources as SI, S2, and S3, and we denote the observed num- 
ber of individuals in each cell as x^, where the indices indi- 
cate presence (subscript = 1) or absence (subscript = 0) in 
each of the 3 sources. For example, xioi denotes the number 
of individuals observed in SI and S3 but not in S2. On occa- 
sion, we will write Su = 1 and Su = 0 to represent presence or 
absence in Source u (u= 1,2,3). 

Clearly, we would expect some dependencies among 
these 3 data sources. In particular, we would anticipate a pos- 
itive dependency between arrest and probation, because we 
would expect individuals committing 1 crime to be more 
likely to commit another. A negative dependency between 
treatment and the 2 criminal justice system sources is also 
possible, although these might be independent. The depen- 
dency structure is further complicated by the presence of re- 
ferrals between sources. Specifically, individuals arrested for 
drug possession may be put on probation, and referrals to 
drug treatment are made from each of the criminal justice sys- 
tem sources. 

Using the standard log-linear modeling framework, if we 
adopt the no S-way interaction rule, there are only 8 possible 
models to consider. These are indicated in Table 2, where we 
denote, for example, an SI by S2 interaction by "SlxS2". 
Model 1, which we refer to as the base case or simply 
"base," assumes independence of the 3 sources. Alterna- 
tively, we can include 1 (models 2-4), 2 (models 5-7), or 
all 3 (model 8) source-by-source interaction terms. Model 8 
is a saturated model, because it involves fitting 7 parameters 
to the 7 observed data points. In addition, we present results 
from 3 alternative saturated log-linear models (models 9-1 1), 
which violate the no S-way interaction rule and therefore are 
not usually considered in CRC applications. Each involves 
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Table 2. Results From Poisson Log-Linear Regression Analyses of the Problem Drug User Data, England, 2009-2010 





Log-Linear Model 


Estimated Missing Cell Count 3 


Estimated Total Population Size 3 " 




Model Fit c 




Model No. 


Description 




95% CI 


N 


95% CI 


AIC 


G 2 


df 


1 


Base 


100,000 


98,200, 102,000 


292,600 


290,700, 294,500 


2,501 


2,419 


3 


2 


Base + S1xS2 


91,000 


89,100, 93,000 


283,600 


281 ,700, 285,600 


2,128 


2,043 


2 


3 


Base + S1xS3 


153,700 


147,700, 159,900 


346,200 


340,200, 352,500 


1,948 


1,864 


2 


4 


Base + S2xS3 


1 nd ^nn 


1 vo ^nn 1 nfi ^nn 


9Qfi Rnn 


9Q4 Ann Qnn 


1,779 


1,694 


2 


5 


Base + S1xS2 + S1xS3 


180,600 


161,800, 201,700 


373,200 


354,400, 394,200 


1 ,g40 


1 ,854 




6 


Base + S1xS2 + S2xS3 


95,400 


93,400, 97,500 


288,000 


285,900, 290,100 


1,474 


1,388 


1 


7 


Base + S1xS3 + S2xS3 


211,500 


202,000, 221,400 


404,000 


394,500, 414,000 


645 


558 


1 


8 


Base + S1 xS2 + S 1 xS3 + S2xS3 


734,500 


648,200, 832,300 


927,000 


840,700, 1 ,024,800 


88 


0 


0 


9 


Base + S 1 xS2 + S 1 xS3 + S 1 xS2xS3 


180,600 


161,800, 201,700 


373,200 


354,400, 394,200 


88 


0 


0 


10 


Base + S 1 xS2 + S2xS3 + S 1 xS2xS3 


95,400 


93,400, 97,500 


288,000 


285,900, 290,100 


88 


0 


0 


11 


Base + S1 xS3 + S2xS3 + S1 xS2xS3 


211,500 


202,000, 221,400 


404,000 


394,500; 414,000 


88 


0 


0 



Abbreviations: AIC, Akaike information criterion; CI, confidence interval. 
a All population size estimates have been rounded to the nearest 100. 

b This is simply the estimated missing cell count plus the observed number of 192,551 (Table 1). 

c G 2 = the likelihood ratio test statistic, displayed with the degrees of freedom for the corresponding x 2 test (P values are omitted because, for all 
nonsaturated models, P< 0.001). 



excluding 1 of the 2-way interaction terms in favor of accom- 
modating a 3-way interaction. 

As saturated models, models 8-1 1 will all fit the data per- 
fectly, but they will provide different extrapolations to the 
missing cell. It can be shown that the maximum likelihood 
estimators for the missing cell based on models 9-11 are in 
fact identical to those from models 5-7. For example, models 
5 and 9 both estimate x 0 oo by Xoio^ooi/^on. This arises from 
both models assuming independence between S2 and S3 in 
the SI = 0 cells. In estimating xqoo, all 4 SI = 1 cells are ig- 
nored. However, model 9, unlike model 5, does not enforce 
the assumption of independence between S2 and S3 among 
the S 1 = 1 cells, and hence will result in better model fit 
(at the cost of an additional parameter). 

Model selection in epidemiologic CRC is often carried out 
by minimizing some information criterion such as the Akaike 
information criterion (AIC) (1, 29). Alternatively, the sim- 
plest model is selected that fits the data according to a likeli- 
hood ratio test statistic G 2 , referred to a % 2 distribution (1, 29). 
In Table 2 we show the AIC and G statistics for each of the 
1 1 models, in addition to estimates of the population size. Es- 
timates from models incorporating the available covariates 
were similar (not shown). We adopt the most common defi- 
nition of the AIC (38, 39), but note that an alternative version 
is popular in the CRC literature. The 2 definitions differ only 
in terms of a constant; because AICs should be interpreted 
only relative to each other rather than in absolute terms, the 
distinction is not important. Regression models were fitted 
using Stata, version 13, statistical software (StataCorp LP, 
College Station, Texas), assuming a Poisson likelihood. We 
coded all models such that the presence or absence in a source 
was indicated by 1 or 0, respectively. Other parameterizations 
can be used, but care must be taken to interpret the interaction 
models accordingly. 



Table 2 shows that the 4 saturated models jointly have by 
far the lowest AIC. The G 2 statistic is seen to be very large 
relative to its degrees of freedom for all other models, indi- 
cating that none of these provides an adequate fit to the 
data. The 4 saturated models are seen to imply widely differ- 
ent estimates of the total population size N, ranging from 
288,000 to 927,000 and with 95% confidence intervals not 
overlapping. 

Following the no S-way interaction rule, models 9-1 1 
would not be considered, so model 8 would be selected. How- 
ever, this is seen to provide an estimate of 927,000 (95% con- 
fidence interval (CI): 840,700, 1,024,800) PDUs in England, 
which is more than 3 times the magnitude of the previously 
published estimate (27) and would imply that 2.7% of adults 
aged 15-64 years were heroin or crack cocaine users. This is 
unsupported by other evidence (such as surveys of the propor- 
tion of people in treatment and the number of drug-related 
deaths) and is considered infeasible by experts. Perhaps the 
prevalence estimate from 1 of the other saturated models is 
more accurate, but there is no obvious reason to select 1 of 
these over the others. Moreover, we will demonstrate below 
that it is perfectly plausible that none of these 4 saturated mod- 
els provides an accurate estimate of the population size, even in 
the absence of realistic additional complications, such as var- 
iable catchability or errors made in the matching process. 

If the "prison" source is also included then, following 
the hierarchy principle, there are 1 13 possible log-linear models 
(2). The best fitting of these was found to be the saturated 
model with six 2-way and four 3-way interactions between 
sources. This model gave a still more implausible estimate 
of 1,614,000 (95% CI: 1,167,900, 2,265,000) PDUs in 
England. Again, alternative saturated models provided very 
different estimates. The methodological problems are there- 
fore clearly not confined to 3-source models. 
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BIAS DUE TO BETWEEN-SOURCE REFERRAL 

To assess whether models incorporating referrals between 
sources can be fitted within the standard log-linear framework, 
we compare the form of the expected cell counts under various 
scenarios, focusing again on the 3-source case for simplicity. 
We denote the conditional probability of an individual appear- 
ing in source u given presence or absence in source v, before 
any referrals take place, by p u \ Sv _ i and p u \ Sv _ 0 , respectively. 
When/7„| Sy= i =p„, Sv = 0 , we simply write p„. 

We consider 3 simple referral scenarios as examples. The 
expected cell counts under each of these, before and after re- 
ferrals, are shown in Table 3. In the first 2 scenarios, there are 



assumed to be no "standard" interactions (i.e., if it were not 
for the referrals, then the data sources would be independent). 
In referral scenario 1 , a proportion q 1 3 of individuals in S 1 are 
referred to S3. This occurs independently of whether they 
were seen in S2 or would otherwise have been seen in S3 any- 
way. From the postreferral expected cell counts, it can be seen 
that this is equivalent to a standard interaction between S 1 
and S3, with p 3 | S i=i =Pi + #13(1 -P3) and p 3 isi=o=/'3- 
Hence, this very simple scenario corresponds to model 3 in 
Table 2. 

However if, in addition, a proportion q 2 3 of individuals 
were independently referred from S2 into S3, then the ex- 
pected cell counts would be as shown in the second part of 



Table 3. Expected Cell Counts in a 3-Source Capture-Recapture Under Hypothetical Referral Scenarios 



Source Combination" 



Expected Cell Counts 



S1 


S2 


S3 


Prereferral 


Postreferral 








Referral Scenario 1 b 


1 


1 


1 


/VP1P2P? 


N Pl p 2 [p 3 + <7i 3 (1 - Ps)] 


1 


1 


0 


WpiflzO - pa) 


A/PiPz{1 -[P3 + <7is(1 - Ps)]} 


1 


0 


"I 


Wpi(1 -ft>)P3 


Wpi(1 -p 2 )[P? + q , is(1 -Pa)] 


1 


0 


0 


Wpi(1 -P2XI - Pa) 


W P i(1 -P2KI -[P3 + £7i3(1 -Ps)]} 


0 


1 


1 


W(1 -p 1 )p 2 P3 


M 1 -Pi)P2Ps 


0 


1 


0 


W(1 -p 1 )p 2 (1 -Ps) 


N(1 -Pi)p 2 (1 - Ps) 


0 


0 


1 


W(1 -Pi)(1 -pa)p3 


NV-p^-pJpa 


0 


0 


0 


M1-Pi)(1-fc)(1-ft) 


N(1-p 1 )(1-p 2 )(1-P3) 








Referral Scenario 2 C 


1 


1 


1 


/VP1P2P? 


Np-,P2[P3 + {qi3 + - <713<72S) (1 - Ps)] 


1 


1 


0 


WpiP2(1 -p 3 ) 


NP1P2V - [Ps + (9is + Q23 - QistoXI - Ps)]} 


1 


0 


1 


Npi(1 -P2)P3 


Np^l -p 2 )[Ps + qi3(1 -A>)] 


1 


0 


0 


Npi(1 -P2XI -p 3 ) 


Np!(1 -p 2 ){1 -[P3 + t7is(1 -Ps)]} 


0 


1 


1 


N(1 -Pl)P2P3 


NO -Pl)P2[P? + Q23(1 - Ps)] 


0 


1 


0 


N(1 -Pi)A>(1 - Ps) 


- Pl)P2{1 - [PS + <723(1 - Ps)]} 


0 


0 


1 


W(1 -Pl)(1 -Pa)P3 


Nfl-^XI-paJpa 


0 


0 


0 


/V(1- P1 )(1-P 2 )(1-P3) 


/VO-p^d-ftXI-pa) 








Referral Scenario 3 d 


1 


1 


1 


WPlP2IS1=lP3IS2=1 


W[PlP2IS1=1 + <73l(1 -Pl)P2IS1=o]P3IS2=1 


1 


1 


0 


Wp lP 2|S1=l(1 -P3IS2=l) 


/VPlP2IS1=l(1 -P3IS2=l) 


1 


0 


1 


N P l(1 -P2IS1=l)P3IS2=0 


W[Pl(1 -P2IS1=l) + Q3l(1 -Pl)(1 -P2IS1=o)]P3IS2=0 


1 


0 


0 


A/Pl(1 -P2IS1=l)0 -P3IS2=o) 


Np^l -p 2 |S1=l)(1 -P3IS2=o) 


0 


1 


1 


-Pl)P2IS1=0P3IS2=1 


A/(1 -Pl)(1 - (73l)P2IS1=0P3IS2=1 


0 


1 


0 


- P l)P2IS1=o(1 -P3IS2=l) 


A/(1 -Pl)p 2 |S1=o(1 -P3IS2=l) 


0 


0 


1 


W(1 -Pl)(1 -P2IS1=o)P3IS2=0 


A/(1 -Pl)(1 -q 3 l)(1 -P2IS1=o)P3IS2=0 


0 


0 


0 


W(1 -Pl)(1 -P2IS1=o)(1 -P3IS2=o) 


A/(1 -Pl)(1 -P2IS1=o)(1 -PSIS2=0) 



a For example, the row with S1 = 1 , S2 = 1 , and S3 = 0 corresponds to the expected number of individuals observed 
in source 1 and source 2 but not in source 3 (the expectation of x 110 ). 

b No standard interactions, but referral of a proportion q 13 of individuals from source 1 into source 3. (Prereferral: 
base. Postreferral: base + referrals from S1 into S3). 

c No standard interactions, but referral of a proportion q 13 of individuals from source 1 into source 3 and, 
independently, a proportion q 23 of individuals from source 2 into source 3. (Prereferral: base. Postreferral: 
base + referrals from S1 into S3 + referrals from S2 into S3). 

d Source 1 by source 2 and source 2 by source 3 interactions, plus referral of a proportion q 13 of individuals from source 
1 into source 3. (Prereferral: base + S1xS2 + S2xS3. Postreferral: base + S1xS2 + S2xS3 + referrals from S1 into S3). 
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Table 3 (referral scenario 2). As we would expect, the prob- 
ability of an individual appearing in S3 now depends on 
appearance or otherwise in each of SI and S2. However, 
we show in Appendix 1 that the corresponding log-linear 
model is not model 7 (base + SlxS3 + S2xS3) but actually 
model 11 (base + SlxS3 + S2xS3 + SlxS2xS3), violating 
the hierarchy principle. Specifically, this means that the refer- 
rals have induced a relationship between sources 1 and 2 
among individuals observed in source 3, but not among indi- 
viduals not seen in source 3. Clearly, even a simple scenario 
involving referrals can correspond to a 3-way interaction term 
in the log-linear model, so that none of the 8 standard models 
is appropriate. 

In the final section of Table 3, we consider the scenario of 2 
standard interactions (SlxS2 and S2xS3) followed by referral 
of a proportion q 3l of individuals from S3 into SI. We show 
in Appendix 1 that the corresponding log-linear model 
requires both a 3-way interaction term and also an SlxS3 in- 
teraction term. With only 7 observed data points, it is not pos- 
sible to fit the appropriate log-linear model, which would 
need to include 4 interaction terms, and therefore 8 parame- 
ters in total. 

In summary, in a 3-source CRC analysis, some simple hy- 
pothetical referral scenarios can be parameterized as log- 
linear models with only 2-way interaction terms, whereas 
others induce a 3-way interaction. Sometimes, for example 
in our referral scenario 3, it will not be possible to accommo- 
date the correct data structure using any identifiable log- 
linear model. 

ARTIFICIAL 3-SOURCE DATA SETS 

We now consider 2 artificial data sets in order to demon- 
strate that standard methods can be very misleading. The 2 
data sets, displayed in Table 4, were simulated from multi- 
nomial models under referral scenarios 2 and 3. Refer to 
the table's legend for full details. In both cases, we treat the 
cell count x 0 oo as missing and estimate it using each of the 1 1 
log-linear models. Population size estimates and model fit 
statistics are shown in Tables 5 and 6. 

As with our real data, we see that the 4 saturated models 
(models 8-11) jointly have by far the lowest AIC statistics. 
All other models have very large G 2 statistics relative to 
their degrees of freedom, indicating inadequate fit to the 
data. Once again, of the saturated models, the default choice 
would be model 8 because of the no S-way interaction rule. 
For the first data set, the resulting estimate of the missing cell 
is 60,400 (95% CI: 56,900, 64,200), with the 95% confidence 
interval not incorporating the true value of 86,600 (Table 4). 
For the second data set, the estimate of the missing cell is 
163,700 (95% CI: 153,800, 174,100), which is much greater 
than the true value (Table 4) of only 57,100. For each data set, 
the alternative saturated models generate widely differing es- 
timates of the hidden population size. 

For artificial data set 1, we know from the section "Bias 
due to between-source referral" that the appropriate log-linear 
model is model 1 1 , and we see in Table 5 that this does in- 
deed recover the true value accurately, estimating the missing 
cell count to be 85,500 (95% CI: 82,000, 89,100). However, 
without expert knowledge of the relationships among the 



Table 4. Two Artificial Data Sets, Simulated From Multinomial 
Models With Referrals 



Source Combinations 9 




Artificial 
Data Set 1 b 


Artificial 


S1 


S2 


S3 


Data Set 2" 


1 


1 


1 


4,397 


8,155 


1 


1 


0 


3,637 


1,616 


1 


0 


1 


25,613 


18,209 


1 


0 


0 


46,312 


61,087 


0 


1 


1 


5,226 


22,964 


0 


1 


0 


6,714 


19,430 


0 


0 


1 


21,506 


11,425 


0 


0 


0 


86,595 


57,114 




Total observed 




113,405 


142,886 




Total population size 


200,000 


200,000 


a For example, the row with S1 = 1 , S2 = 1 , and S3 = 0 shows x-, w , 
the number of individuals observed in source 1 and source 2 but not in 



source 3. 

b Simulated from referral scenario 2 in Table 3 (no standard 
interactions, but referral of a proportion q-, 3 of individuals from 
source 1 into source 3 and, independently, a proportion q 23 of 
individuals from source 2 into source 3), assuming the following 
parameter values: N= 200,000, pi =0.4, P2 = 0.1, p 3 = 0.2, g 13 = 0.2, 
and 923 = 0.3 {N= total population size, and Pu = prereferral probability 
of appearing in source u). 

c Simulated from referral scenario 3 in Table 3 (standard source 1 by 
source 2 and source 2 by source 3 interactions, plus referral of a 
proportion <7 13 of individuals from source 1 into source 3), assuming 
the following parameter values: N= 200,000, p-i =0.4, P2isi=o = 0.4, 
P2isi=i =0.05, P3is2=o = 0.2, P3is2=i = 0.6, and Qi 3 = 0.2 (A/=total pop- 
ulation size, and p u \ Sv =-\ and p U | Sv =o are the prereferral probabilities 
of appearing in source u, given presence or absence in source v). 



3 data sources and the mathematical analysis above, it would 
be impossible to choose among models 8-1 1 if all were con- 
sidered. As noted above, model 7 also gives the correct max- 
imum likelihood estimate. However, this model fits the 
observed data poorly and, as such, would not be selected 
by analysts in practice. 

We showed in the section titled "Bias due to between-source 
referral" and in Appendix 1 that none of the 1 1 log-linear mod- 
els corresponds to the truth for the model used to generate ar- 
tificial data set 2. In fact, none of the 4 saturated models in 
Table 6 produces a 95% confidence interval that includes the 
true value. If, however, we knew the underlying structure of 
the data, then it would be possible to obtain an unbiased esti- 
mate by formulating a model directly in terms of meaningful 
parameters (probabilities and proportions) of the type in 
Table 3. As an example, in Appendix 2 we present WinBUGS 
(40) code for fitting the correct model to artificial data set 2. 
Here we assume uninformative uniform(0, 1 ) prior distributions 
for each of the p and q parameters and a vague half-normal dis- 
tribution for the log of the total population size. Fitting this 
model, the resulting estimate of the unseen population size 
was 56,900 (95% credible interval: 55,000, 58,800), implying 
a total population size of 199,800 (95% credible interval: 
197,900, 201,700), which is extremely close to the true 
value of 200,000. Note that, because this "correct" model is 
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Table 5. Results From Poisson Log-Linear Regression Analyses Applied to Artificial Data Set 1 





Log-Linear Model 


Estimated Missing Cell Count"'" 


Estimated Total Population Size ac 




Model Fit" 




Model No. 


Description 




95% CI 


W 


95% CI 


AIC 


G 2 


df 


1 


Base 


41,000 


40,300, 41 ,700 


154,400 


153,700, 155,100 


2,825 


2,739 


3 


2 


Base + S1xS2 


34,600 


33,900, 35,200 


148,000 


147,300, 148,600 


1,345 


1,258 


2 


3 


Base + S1xS3 


47,300 


45,900, 48,800 


160,700 


159,300, 162,200 


2,705 


2,617 


2 


4 


Base + S2xS3 


46,000 


45,200, 46,900 


159,400 


158,600, 160,300 


1,446 


1,358 


2 


5 


Base + S1xS2 + S1xS3 


27,600 


26,600, 28,700 


141,000 


140,000, 142,100 


1,179 


1,089 


1 


6 


Base + S1xS2 + S2xS3 


38,900 


38,100, 39,700 


152,300 


151,500, 153,100 


321 


231 


1 


7 


Base + S1xS3 + S2xS3 


85,500 


82,000, 89,100 


198,900 


195,400, 202,500 


330 


241 


1 


8 


Base + S1 xS2 + S1 xS3 + S2xS3 


60,400 


56,900, 64,200 


173,800 


170,300, 177,600 


92 


0 


0 


9 


Base + S1 xS2 + S1 xS3 + S1 xS2xS3 


27,600 


26,600, 28,700 


141,000 


140,000, 142,100 


92 


0 


0 


10 


Base + S1 xS2 + S2xS3 + S1xS2xS3 


38,900 


38,100, 39,700 


152,300 


151,500, 153,100 


92 


0 


0 


11 


Base + S1 xS3 + S2xS3 + S1 xS2xS3 


85,500 


82,000, 89,100 


198,900 


195,400, 202,500 


92 


0 


0 



Abbreviations: AIC, Akaike information criterion; CI, confidence interval. 
a All population size estimates have been rounded to the nearest 100. 

b Estimates of the missing cell count should be compared with the true value of 86,595 (Table 4). 

c This is simply the estimated missing cell count plus the observed number of 1 1 3,405 (Table 4). Estimates should be compared with the true 
value of 200,000. 

d G 2 = the likelihood ratio test statistic, displayed with the degrees of freedom for the corresponding % 2 test (P values are omitted because, for all 
nonsaturated models, P< 0.001). 



1 of many possible saturated models (others including models 
8-1 1 in Table 6), we would not possibly know to select it with- 
out external knowledge of the data structure. 

DISCUSSION 

We have demonstrated that, in the presence of referral mech- 
anisms between sources, the standard log-linear modeling 



approach to CRC is too restrictive and can lead to grossly mis- 
leading prevalence estimates. The absolute bias will be larg- 
est when the proportion of the population observed is low. 
Model averaging (41) is not an appropriate solution if none 
of the models considered is "correct" and/or if only saturated 
models fit the data, because potentially widely varying esti- 
mates from different saturated models would be assigned 
equal weight. 



Table 6. Results From Poisson Log-Linear Regression Analyses Applied to Artificial Data Set 2 





Log-Linear Model 


Estimated Missing Cell Count 9 '" 


Estimated Total Population Size a c 


Model Fit" 




Model No. 


Description 




95% CI 


N 


95% CI 


AIC 


G 2 


df 


1 


Base 


57,000 


56,200, 57,800 


199,900 


199,100, 200,700 


51,619 


51,532 


3 


2 


Base + S1xS2 


19,000 


18,600, 19,400 


161,900 


161,500, 162,300 


21,365 


21 ,276 


2 


3 


Base + S1xS3 


53,800 


52,800, 54,900 


196,700 


195,700, 197,800 


51,544 


51 ,454 


2 


4 


Base + S2xS3 


117,500 


115,600, 119,500 


260,400 


258,500, 262,300 


18,469 


18,380 


2 


5 


Base + S1xS2 + S1xS3 


9,700 


9,400, 9,900 


152,600 


152,300, 152,800 


14,078 


13,986 


1 


6 


Base + S1xS2 + S2xS3 


38,300 


37,400, 39,300 


181,200 


180,300, 182,200 


3,210 


3,119 


1 


7 


Base + S1xS3 + S2xS3 


734,500 


697,700, 773,200 


877,400 


840,600, 916,100 


7,936 


7,845 


1 


8 


Base + S1 xS2 + S1 xS3 + S2xS3 


163,700 


153,800, 174,100 


306,500 


296,700, 317,000 


93 


0 


0 


9 


Base + S1 xS2 + S1xS3 + S1 xS2xS3 


9,700 


9,400, 9,900 


152,600 


152,300, 152,800 


93 


0 


0 


10 


Base + S1 xS2 + S2xS3 + S1 xS2xS3 


38,300 


37,400, 39,300 


181,200 


180,300, 182,200 


93 


0 


0 


11 


Base + S1 xS3 + S2xS3 + S1 xS2xS3 


734,500 


697,700, 773,200 


877,400 


840,600, 916,100 


93 


0 


0 



Abbreviations: AIC, Akaike information criterion; CI, confidence interval. 
a All population size estimates have been rounded to the nearest 1 00. 

b Estimates of the missing cell count should be compared with the true value of 57,1 14 (Table 4). 

c This is simply the estimated missing cell count plus the observed number of 142,886 (Table 4). Estimates should be compared with the true 
value of 200,000. 

d G 2 = the likelihood ratio test statistic, displayed with the degrees of freedom for the corresponding % 2 test (P values are omitted because, for all 
nonsaturated models, P< 0.001). 
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As early as 1968, Wittes and Sidel (42) noted cause for 
concern if it is suspected that information from 1 CRC source 
has been obtained directly from another. This was before the 
advent of log-linear models. Inclusion of interaction terms in 
a log-linear model is now the standard methodology for ac- 
counting for source dependencies (2, 29), with the highest- 
order interaction term assumed to equal 0. We are not the 
first to point out that this approach is not infallible (35, 36). 
In particular, Hook and Regal (2) have suggested a need for 
caution when the standard saturated model is selected by 
some criterion such as the AIC. However, the fact that alter- 
native saturated models can provide such dramatically differ- 
ent estimates of the population size does not appear to have 
been recognized. For example, Regal and Hook (36) demon- 
strated an application with likely violation of the "no S-way 
interaction" rule, but the inferred interaction was small, so 
that ignoring it was found to have little effect on the estimate 
of the population size. 

We highlight precapture as a particular source of concern 
because we believe that many CRC studies in epidemiology 
are likely to involve referrals between sources and might 
therefore be subject to bias. Policies that seek to move drag 
users identified in the criminal justice system into treatment 
(43) complicate attempts to estimate the prevalence of drag 
addiction using CRC. More generally, applications of CRC 
in epidemiology frequently combine reports from clinical 
and laboratory sources, from primary care and secondary 
care settings, or from different medical specialists (44). It 
seems almost inevitable that some referrals will occur be- 
tween such sources. For example, Van Hest et al. (45) used 
infectious disease notifications, laboratory reports, and data 
on hospitalizations to estimate tuberculosis prevalence, but 
they noted that "[tuberculosis] services in England are orga- 
nized around close collaboration between clinicians, microbi- 
ologists and public health professionals." Similarly, CRC has 
been used to estimate the incidence of congenital cataract by 
combining information from pediatricians and pediatric oph- 
thalmologists (46). However, cataract is often a manifestation 
of a wider dysmorphic syndrome. Good practice dictates that 
children who present first to pediatricians will be referred to 
ophthalmologists, whereas those presenting first to ophthal- 
mologists are referred to pediatricians to be investigated for 
other problems associated with these syndromes (47). 

Again, it seems likely that there are referrals among hospi- 
talizations, outpatient visits, and nursing home admissions, 
which are 3 of the sources used by Turabelidze et al. (48) 
in estimating multiple sclerosis prevalence. Papoz et al. 
(49) noted that a positive dependency between a list of pa- 
tients from physicians' practices and a list of patients given 
prescriptions is "hardly surprising, since the physician is 
the prescribes " The tendency for 2 malaria reporting systems 
to alert each other was observed by Cathcart et al. (50), and 
Wittes (32) discusses a potential scenario in which some 
names are actually copied from 1 list to another. 

We have demonstrated that, in certain circumstances, an 
alternative saturated log-linear model might successfully ac- 
count for referrals. Models 9-11 are based on the assumption 
of independence of 2 of the sources in the subset of individ- 
uals not seen in the third, but on dependence in the subset ap- 
pearing in the third. Use of these models requires a priori 



justification of why this might be more plausible for the spe- 
cific data set than the assumption of no 3-way interaction. 
More generally, in agreement with Regal and Hook (36) 
and Cormack (51), we encourage model choice to be guided 
by an a priori in-depth consideration of the likely relation- 
ships among the data sources, guided by discussions with rel- 
evant experts, rather than by blind application of some set of 
statistical criteria. In particular, we urge analysts to explicitly 
consider whether referrals between data sources are likely. 
This might be an additional rule to add to the list of recom- 
mendations of Regal and Hook (52, 53). If no referrals are 
present, then standard log-linear methods can be used, al- 
though it might be helpful to instead write models directly 
in terms of meaningful parameters. This facilitates assess- 
ment of the plausibility of the estimated parameters, to be 
used alongside standard model fit statistics to assess model 
adequacy. 

If referrals are believed to be present, then expected cell 
counts should be formulated in terms of probabilities and pro- 
portions, as in Table 3. We have demonstrated that a model pa- 
rameterized in this way can be fitted using the Bayesian 
software WinBUGS, but we note that it might also be possi- 
ble to fit this model using a maximum likelihood approach 
(36). Our software choice was primarily for pragmatic rea- 
sons: WinBUGS automatically provides a credible interval 
for the population size in addition to a point estimate and 
makes it simple to enforce constraints such that probability 
parameters lie between 0 and 1 . 

For our motivating example, we are not confident that any 
of the alternative saturated log-linear models (Table 2) is 
"correct." For example, the structure assumed by model 9 
allows for referrals from both arrest and probation into treat- 
ment, but it does not allow for the "standard" interactions that 
we also expect between these data sources, nor for referrals 
from arrest into probation. The true expected dependence 
structure is, unfortunately, too complex for all parameters 
to be identifiable without additional information. The best ad- 
vice to those planning a CRC is to avoid this difficult problem 
by careful selection of data sources, such that complex depen- 
dence structures are avoided. If this is impossible, then we en- 
courage the collection of auxiliary information at the point of 
data matching, such as referral source if known, and/or the 
specific dates at which individuals appeared in each source. 
Using this information, it may be possible to quantify the 
probability that those in overlap sets have been referred, 
and hence to "remove" likely precaptures prior to analysis. 

An alternative solution might be to seek external informa- 
tion to inform the referral parameters. Here, a Bayesian ap- 
proach could be particularly useful because it offers scope 
for the incorporation of such external information as informa- 
tive prior distributions. The external information required 
need not inform the referral proportions directly; external 
data can be used to inform a function of parameters within 
the model (54). However, a great degree of care is needed, 
because the information must reflect as closely as possible the 
way in which the specific data were collected, in particular the 
length of the time window. More generally, a Bayesian frame- 
work can be used to incorporate formally other sources of infor- 
mation on prevalence, such as data on drag-related deaths (55). 
A "multiparameter evidence synthesis" framework (56, 57), in 
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which multiple sources of data are incorporated into a single 
coherent model, offers potential for formal assessment of the 
consistency of evidence from different sources. Careful mod- 
eling and accounting for known biases of course remain 
essential. 
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APPENDIX 1 : STATISTICAL DETAILS 

We write L» for the expectation of each cell count x ijk . The 
full log-linear model specification is below, where / is the 
standard indicator function, oc is an intercept term, and 
Pi — P3 are main effects of inclusion in each of the 3 sources. 
Source-by-source interaction terms are denoted by 8 12 , 613, 
and 8 2 3, whereas we denote the 3-way interaction term by y. 

log(Ap) = a + py(/ = 1) + p 2 /(./ = 1) 

+ M*= l)+8 12 /(/= \)I{j= 1) 

+ 8 n I(i = = 1) + 823/(7 = W = 1) 

+ y/(i= !)/(./= !)/(*=!). 
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This full model, with all 8 possible parameters, implies the 
following 8 expected cell counts: 



S1 


S2 


S3 




1 


1 


1 


exp(a + P, + + + §12 + 5,3 + 5 23 + y) 


1 


1 


0 


exp(a+p, +p 2 + 5 12 ) 


1 


0 


1 


exp(a+p, + p 3 + 8 13 ) 


1 


0 


0 


exp(a+ p,) 


0 


1 


1 


exp(a + p 2 + p 3 + 5 23 ) 


0 


1 


0 


exp(a + p 2 ) 


0 


0 


1 


exp(a + p 3 ) 


0 


0 


0 


exp(a) 



Now consider, for example, the cross-product of the ex- 
pected cell counts in the 4 cells representing individuals 
not seen in source 3. 

^lio^ooo _ exp(a + p t + p 2 + 812) exp(ot) _ ^ 
A, 100 Aoio ex P( a + Pi ) ex P( a + P2) 

As such, if there is no source 1 by source 2 interaction pa- 
rameter in the model (812 = 0), then this cross-product must 
equal 1. 

Similarly, because the cross-product in the S3 = 1 cells is 

^111^001 

^101^011 

_ exp(a+ p t + p 2 + P 3 + 812 + 813 + 8 23 + 7123) exp(a+ p 3 ) 
exp(a+ p t + P 3 +813) exp(a+ P 2 + P 3 + 8 23 ) 

_ g 8l2+Y 

then the 3-way cross-product is equal to 

^-111^100^010^001 ^ 
^-011^000^-110^101 

Therefore, if there is no 3-way interaction parameter in the 
model (y = 0), then this 3-way cross-product must equal 1 (4). 



Using the expected cell counts from Table 3, this 3-way 
cross product term (equation 2) for our referral scenario 2 
is seen to be 

^•111^100^010^001 P3 + (<7l3 + 923 — 9l3923j(l — P3) 
^OllA.OOO^lloA.101 [P3 + 913(1 —P3)][P3 + 923(1 -pi)] 

which is equal to 1 only if either g 13 = 0 or q 2 3 = 0. However, 
the cross-product in the S3 = 0 cells (equation 1 ) is equal to 1 . 
This shows that, although an SlxS2xS3 interaction term is 
needed in the log-linear model to accommodate the referral 
structure, an SlxS2 interaction term is not. Specifically, the 
referrals from sources 1 and 2 into source 3 induce a depend- 
ency between SI and S2 in the S3 = 1 cells, but not in the 
S3 = 0 cells. Model 1 1 appropriately accommodates this de- 
pendence structure. 

For our referral scenario 3, we find that the 3-way cross- 
product (equation 2) is 

^•111^-100^010^001 

^011^000^110^-101 

Lglg2|Sl=l + g3l(l -Pl)j>2|Sl=o](l ~P2|Sl = l) 
P2|Sl=lbl(l -P2|Sl = l) + 93l(l -pi)(\ -P2|S1=0)] ' 

so that a 3-way interaction term must be present in the cor- 
responding log-linear model unless either q 13 = 0 or p2isi=i = 
P2isi=o- Further, consideration of the cross-product in the 
S2 = 0 cells 

^101X000 _ Pi(l -yg2JSl=l) + 93i(l -p 2 |si=o) 

^-001^-100 (1 — 93l)Pl(l — / ? 2|Sl = l) 

shows that this is equal to 1 only if q 13 = 0; otherwise, an 
SlxS3 interaction term must also be present in the cor- 
responding log-linear model. With only 7 observed data points, 
it is therefore not possible to fit the appropriate log-linear 
model, which would need to include a 3-way interaction 
term in addition to all three 2-way interaction terms, and 
therefore 8 parameters. 



APPENDIX 2: WinBUGS CODE 

WinBUGS model code for fitting the correct model for referral scenario 3 (standard SlxS2 and S2xS3 interactions plus referral 
of a proportion g 31 of individuals from source 3 into source 1). 

We assume that the 7 observed cell counts have been formatted into a vector, such that x[l] = Xi n, x[2] =x uo , . . , , x[7] = x 00 i . 

model { 

# Split the full multinomial likelihood into two parts 

# First part: multinomial likelihood conditional on being observed : 
x [ 1 : 7 ] ~ dmulti ( prob . obs [1:7] , x . obs ) 

x . obs < - sum ( x [ 1 : 7 ] ) 
for(i in 1:7) { 
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# Probability of being in the ith cell conditional on being observed: 
prob . obs [ i ] < - prob [ i ] / sum ( prob [1:7] ) 

} 

# Second part : binomial likelihood for being observed 
x . obs ~ dbin ( sum . prob, N) 

sum. prob <- sum ( prob [1:7]) # probability of being observed 
xmiss <- N - x . obs # number of missing individuals 

# Specify the 8 cell probabilities directly 

# in terms of intuitive parameters : 

prob[l] <- (pl*p2Sll+ q31* (1-pl) *p2S10) *p3S21 
prob[2] <- pl*p2Sll* (l-p3S21) 

prob [3] <- (pi* (l-p2Sll) +q31* (1-pl) * (l-p2S10) ) *p3S2 0 

prob[4] <- pi* (l-p2Sll) * (l-p3S20) 

prob[5] <- (1-pl) * (l-q31) *p2S10*p3S21 

prob[6] <- (1-pl) *p2S10* (l-p3S21) 

prob[7] <- (1-pl) * (l-q31) * (l-p2S10) *p3S20 

prob[8] <- (1-pl) * (l-p2S10) * (l-p3S20) 

# Vague prior distributions for probabilities : 
pl~dunif (0,1) 

p2S10 ~dunif (0, 1) 
p2Sll~dunif (0,1) 
p3S20 ~dunif (0, 1) 
p3S21~dunif (0,1) 

# Vague prior for proportion referred from Source 3 into Source 1 : 
q31~dunif (0,1) 

# Vague prior for total population size : 
log. x. obs <- log(x.obs) 

log (N) <- logN 

logN ~ dnorm (0, 0 . 0 0 0 1 ) I ( log . x . obs , ) 
} 
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